Chunk Data Model supports per-chunk service event mapping #6744

jordanschalm · 2024-11-20T16:59:02Z

This PR adds support for specifying which service events were emitted in which chunk, by modifying the ChunkBody data structure in a backward compatible manner. Addresses #6622.

Changes

Adds ServiceEventCount field to ChunkBody:
- This field creates an explicit mapping, committed to by Execution Nodes, of service event to chunk. This allows Verification Nodes to know which service events to expect when validating any chunk.
- This field is defined to be backward-compatible with prior data model versions:
  - Existing serializations of ChunkBody will unmarshal into a struct with a nil ServiceEventCount. We define a chunk with nil ServiceEventCount to have the same semantics as before the field existed: if any service events were emitted, then they were emitted from the system chunk.
  - Post software upgrade, all new (honest) serializations of ChunkBody will always have non-nil ServiceEventCount

Upgrade Notes

Source of truth for upgrade plans (still WIP): https://flowfoundation.notion.site/EFM-Recovery-Release-Upgrade-Plan-WIP-14d1aee1232480228a87e43933815285?pvs=4

Note: Implementation changes associated with the upgrade process will be implemented separately, when the upgrade process is fully specified (see #6777).

#6783 captures changes required around the upgrade behaviour.

To Do Before Merging

Ensure consistent hashing between ExecutionResult versions
Update description in Remove ChunkBody backward-compatibility #6773
Check if necessary to update RPC model (rpc conversion tests fail with non-nil ServiceEventCount field)
Should this directly target master / the current spork branch, as this HCU will occur before EFM Recovery? (See upgrade plan)

This PR replaces two prior approaches, implemented in part #6629 and #6730.

codecov-commenter · 2024-11-20T17:04:23Z

Codecov Report

Attention: Patch coverage is 70.75472% with 31 lines in your changes missing coverage. Please review.

Project coverage is 41.72%. Comparing base (7c71c41) to head (9a10320).

Files with missing lines	Patch %	Lines
utils/slices/slices.go	0.00%	9 Missing ⚠️
utils/unittest/encoding.go	0.00%	7 Missing ⚠️
engine/execution/block_result.go	44.44%	4 Missing and 1 partial ⚠️
utils/unittest/fixtures.go	0.00%	4 Missing ⚠️
model/flow/chunk.go	94.23%	2 Missing and 1 partial ⚠️
module/chunks/chunkVerifier.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@                  Coverage Diff                  @@
##           feature/efm-recovery    #6744   +/-   ##
=====================================================
  Coverage                 41.71%   41.72%           
=====================================================
  Files                      2030     2031    +1     
  Lines                    180459   180552   +93     
=====================================================
+ Hits                      75285    75329   +44     
- Misses                    98978    99031   +53     
+ Partials                   6196     6192    -4

Flag	Coverage Δ
unittests	`41.72% <70.75%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexHentschel · 2024-11-23T07:34:08Z

strategic / conceptual thoughts

I am thinking about enforcing activation of the protocol extension (specifically, the usage of the Chunk.ServiceEventIndices field). For example, it would be nice if consensus nodes could drop Execution Results with the deprecated format and avoid incorporating them into a block (and rejecting blocks that still do).
To ensure consensus incorporates only execution receipts following the new convention after a certain height, it would be great if we could include some consistency check also in the receiptValidator (somewhere around here).
I was wondering if your plan is still to remove Chunk.ServiceEventIndices field for Byzantine Fault Tolerance in the long term? I think we had talked about turning ExecutionResult.ServiceEventList into an indexed list. Or are you thinking about keeping the ServiceEventIndices field as the long-term solution -- just with removing the backwards-compatibility case?

In general, my preference would be to also allow Chunk.ServiceEventIndices = nil when there are no service events generated in the chunk. Thereby the final solution becomes a lot more intuitive:
- if ExecutionResult.ServiceEvents is not empty (nil allowed), then the new convention requires for consistency:
  
  $$\sum_\texttt{Chunk} \texttt{len}(Chunk.ServiceEventIndices) = \texttt{len}(\texttt{ExecutionResult.ServiceEventList})\qquad\qquad\qquad\qquad(1) $$
  
  I feel is check is very similar to other stuff, which consensus nodes already verify about an Execution Receipt (see ReceiptValidator implementation)
- We would temporary relax the new convention, eq. $(1)$, for downwards compatibility as follows: the ServiceEventIndices fields of all chunks can be nil despite there being service events.
- As service events are rare, ExecutionResult.ServiceEventList is empty in the majority cases. Then both the deprecated as well as the new conventions would allow ChunkBody.ServiceEventIndices to be nil (which is the most intuitive convention anyway). Also, for individual chunks that don't produce any service events, their ChunkBody.ServiceEventIndices could be nil or empty according to the new convention. Then, also the new convention is very self-consistent in my opinion - and the depreciation condition is only an add on that can be later removed.

I mean is this is only a temporary solution, I am happy and we can skip most of question 3.

jordanschalm · 2024-11-25T19:52:22Z

Summary of Discussion with @AlexHentschel

Change of `ServiceEventIndices` structure

Replaces list with ServiceEventsNumber, a uint16 which is a count of the number of service events emitted in that chunk
- Since VNs have access to all chunks, they can easily compute the index range based on this field alone
- This is a more compact, and fixed-size representation
- Structural validation is simpler, we require only that sum(chunk.ServiceEventsNumber for chunk in chunks) == len(ServiceEvents)
Backward compatibility:
- If any service events are emitted in a result, and all ServiceEventsNumber fields are 0, then this is interpreted as a v0 model version: all service events must have been emitted in the system chunk.

Removing backward-compatibility at next spork

We plan to keep the overall structure as the long term solution, and only remove the backward-compatibility support at the next spork. We do not plan to use a different representation (ie. Chunk.ServiceEventIndices field).

Upgrade Comments

Protocol HCUs are triggered by a ProtocolVersionUpgade service event. This service event is emitted outside the system chunk, meaning that the first Protocol HCU must take place after the service event validation fix has been deployed.
We plan to incorporate all changes under feature/efm-recovery in one Protocol HCU.

Rough Outline of Process

Do a manual rolling upgrade of all node roles, to a version including feature/efm-recovery.
Manually verify that all ENs are upgraded (this is the only correctness-critical step!)
- this is necessary prior to emitting the first ex-system-chunk service event
- ideally, VNs are also updated, but we can rely on emergency sealing as a fallback if necessary
Emit ProtocolVersionUpgrade service event, scheduling the protocol upgrade at view V.
- Nodes which are not upgraded when we enter view V will halt.
- We must have a substantial majority (>> supermajority) of SNs and LNs for the network to remain live, before entering view V (this is the only liveness-critical step!)

jordanschalm · 2024-12-02T23:12:37Z

Summary of discussion with @zhangchiqing

Let's call v0 the old version and v1 the new version. Leo pointed out the v0 nodes will fail to validate data models produced by v1 nodes. In particular, this step:

Do a manual rolling upgrade of all node roles, to a version including feature/efm-recovery.

will not work, if v1 nodes immediately begin produce v1 chunks with non-nil ServiceEventsNumber field.

This means the upgrade needs to be split into multiple steps, because v0 software will be unable to read v1 data models with non-nil new fields (will produce different hashes). Instead, we need:

Rolling upgrade from software version v0 to v1 (v1 still produces chunks with nil field)
Protocol HCU, after which v0 nodes cannot progress. At this point, v1 nodes begin producing v1 chunk models with non-nil fields. Chunk models with nil fields are also still accepted (due to sealing lag).
Protocol HCU, after which v0 data models are no longer accepted in new blocks

Because of the above additional complexity, I think we should revert the ProtocolStateVersionUpgrade service event to be emitted in the system chunk (at least to start), because we need Protocol HCUs to safely roll out the breaking chunk model change. After the upgrade is complete, all subsequent service events may be emitted outside the system chunk.

zhangchiqing · 2024-12-03T01:16:32Z

I think we can still do with just one HCU, but we might need 2 rolling upgrades. This is the steps, basically Step 1 and Step 3 are rolling upgrades, Step 2 is an protocol HCU:

Rolling upgrade from software version v0 to v1 (v1 still produces chunks with nil field)
Making sure that:

v1 EN will produce v1 result that have chunks with nil field, so that v1 result can be accepted by both v0 SN and v1 SN.
v1 result with nil field in chunk produce the same result ID as the v0 result decoded from the v1 result.
v1 block with the v1 result that has nil field, also produce the same block ID as the v0 block with the decoded v0 result, so that v1 block can be accepted by both v0 nodes or v1 nodes.

Protocol HCU, after which results with nil field will not be accepted. At this point, v1 ENs begin producing v1 chunk models with non-nil fields for blocks above HCU height. Details:

Once an protocol HCU event is emitted, then:

When v1 SN receiving results, it will reject results with nil field below the HCU height.
When v1 SN building blocks for height above the HCU height, it will include v1 results with non-nil fields only. Note, due to sealing lag, v1 SN might build some blocks that still have some results with nil field, because they are for blocks below the HCU height.
When any v1 nodes receiving blocks, it will reject any block that contains result above the HCU height that has nil field in chunks
Blocks produced from v0 SN for blocks above HCU won’t be accepted by v1 SN
v0 SN cannot process after HCU because it won’t accept blocks from v1 SN (the block hash doesn’t match)

Rolling upgrade that v0 result or v1 result with nil field will not be accepted in new blocks

Once HCU height in step 2 has been sealed, and all ENs are on v1, we can roll out this upgrade, because we are sure all results in new blocks won’t have nil field in chunks.

jordanschalm · 2024-12-04T00:36:55Z

@zhangchiqing I agree that would work. Though, I suspect we will prefer fewer software upgrades, so we depend less on actions by node operators, but it's good to know that we have flexibility.

I have written a version of the upgrade plan based on our last discussion (with 2 HCUs and 1 rolling upgrade) here. The upgrade process is fairly simple - there is also an enumeration of versions, which I hope can be generalized beyond this specific example.

AlexHentschel

Looks great. I struggled with understanding the Chunk Verifier Tests. I think we can fix this with some documentation. They are tests, its fine if the documentation is repetitive.

model/flow/chunk.go

AlexHentschel · 2024-12-03T23:24:11Z

model/flow/chunk.go

+	// (2) Otherwise, ServiceEventCount must be non-nil.
+	// Within an ExecutionResult, all chunks must use either representation (1) or (2), not both.
+	ServiceEventCount *uint16
+	BlockID           Identifier // Block id of the execution result this chunk belongs to


do we maybe want to move this to the beginning of the Chunk Body? I think conceptually, that would be more consistent.

I agree, but RLP encoding depends on field ordering within structs, so doing this would change the ID computation (unless we over-rode again, using the RLP encoding).

Added a test case to validate this a61e30e

model/flow/chunk.go

AlexHentschel · 2024-12-07T04:52:19Z

module/chunks/chunkVerifier_test.go

+// Tests the case where a service event is emitted outside the system chunk
+// and the event computed by the VN does not match the Result.
+// NOTE: this test case relies on the ordering of transactions in generateCollection.
+func (s *ChunkVerifierTestSuite) TestServiceEventsMismatch_NonSystemChunk() {
+	script := "service event mismatch in non-system chunk"
+	meta := s.GetTestSetup(s.T(), script, false, true)
+	vch := meta.RefreshChunkData(s.T())
+
+	// modify the list of service events produced by FVM
+	// EpochSetup event is expected, but we emit EpochCommit here resulting in a chunk fault
+	epochCommitServiceEvent, err := convert.ServiceEvent(testChain, epochCommitEvent)
+	require.NoError(s.T(), err)
+
+	s.snapshots[script] = &snapshot.ExecutionSnapshot{}
+	s.outputs[script] = fvm.ProcedureOutput{
+		ComputationUsed:        computationUsed,
+		ConvertedServiceEvents: flow.ServiceEventList{*epochCommitServiceEvent},
+		Events:                 meta.ChunkEvents[:3],
+	}
+
+	_, err = s.verifier.Verify(vch)
+
+	assert.Error(s.T(), err)
+	assert.True(s.T(), chunksmodels.IsChunkFaultError(err))
+	assert.IsType(s.T(), &chunksmodels.CFInvalidServiceEventsEmitted{}, err)
+}
+
+// Tests that service events are checked, when they appear outside the system chunk.
+// NOTE: this test case relies on the ordering of transactions in generateCollection.
+func (s *ChunkVerifierTestSuite) TestServiceEventsAreChecked_NonSystemChunk() {
+	script := "service event in non-system chunk"
+	meta := s.GetTestSetup(s.T(), script, false, true)
+	vch := meta.RefreshChunkData(s.T())
+
+	// setup the verifier output to include the correct data for the service events
+	output := generateDefaultOutput()
+	output.ConvertedServiceEvents = meta.ServiceEvents
+	output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event
+	s.outputs[script] = output
+
+	spockSecret, err := s.verifier.Verify(vch)
+	assert.NoError(s.T(), err)
+	assert.NotNil(s.T(), spockSecret)
+}


I am struggling convincing myself that we really test the correct edge case. In my mind, we are trying to test the following complementary aspects, with as little gaps as possible::

Situation: a non-service chunk containing a service event (honest). Expected: pass. I think this is tested in TestServiceEventsAreChecked_NonSystemChunk

Exactly the same as situation 1, but that the ConvertedServiceEvents is different.

I got confused here, because in test TestServiceEventsMismatch_NonSystemChunk too many lines of code are different, which each could be symptom a chunk fault. For me, it would really help if TestServiceEventsMismatch_NonSystemChunk mirrored TestServiceEventsAreChecked_NonSystemChunk with as little changes as possible.

For me, it would really help if TestServiceEventsMismatch_NonSystemChunk mirrored TestServiceEventsAreChecked_NonSystemChunk with as little changes as possible.

They are different because the existing testing infrastructure has very different code-paths for the system chunk and other chunks. Unfortunately I don't think it is feasible to make them more similar without a larger refactor of this test file.

AlexHentschel · 2024-12-07T04:54:09Z

module/chunks/chunkVerifier_test.go

+	s.snapshots[script] = &snapshot.ExecutionSnapshot{}
+	s.outputs[script] = fvm.ProcedureOutput{
+		ComputationUsed:        computationUsed,
+		ConvertedServiceEvents: flow.ServiceEventList{*epochCommitServiceEvent},
+		Events:                 meta.ChunkEvents[:3],
+	}


To me this seems significantly different from Specifically, I don

flow-go/module/chunks/chunkVerifier_test.go

Lines 326 to 330 in 176100f

// setup the verifier output to include the correct data for the service events

output := generateDefaultOutput()

output.ConvertedServiceEvents = meta.ServiceEvents

output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event

s.outputs[script] = output

and I don't understand why this needs to be. In the end, we want to confirm that the verifier catches it if only one detail is different from the honest execution.

I think your comment got cut off. The portion of code you linked is constructing an expected output for a transaction in which a service event was emitted (outside the system chunk).

Line 328 is assigning the default service events for the output for a non-system-chunk transaction, as the expected output

Line 329 is pulling out the 3 events associated with the transaction, and adding it to the expected output

Line 330 is inserting the expected output into the map, so the verifier will consider this the canonical output for the transaction

AlexHentschel · 2024-12-07T04:56:02Z

module/chunks/chunkVerifier_test.go

+	// setup the verifier output to include the correct data for the service events
+	output := generateDefaultOutput()
+	output.ConvertedServiceEvents = meta.ServiceEvents
+	output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event


why are we trimming the meta.ChunkEvents here? The chunk events can be more, can't they?

The testing framework has lots of layers, which I am struggling with. Though, to the best of my limited understanding, honest chunk data should be consistent with the verifier's local output. The chunk data here is represented by meta and the derived vch. So to me, the following assignment would make sense:

output.Events = meta.ChunkEvents

The output is the expected output for one transaction. The test framework adds 2 events (contents of eventsList) as the expected output for every transaction by default. If specified (new option in this PR), it will additionally add 1 service event (3 total).

module/chunks/chunkVerifier_test.go

Co-authored-by: Alexander Hentschel <[email protected]>

…ow-go into jord/6622-chunk-service-events

zhangchiqing

Nice tests. LGTM

model/flow/chunk.go

Co-authored-by: Leo Zhang <[email protected]>

add chunk model changes and encoding tests

e59dc76

jordanschalm added 10 commits November 20, 2024 09:21

add servcie event indices to chunk

869feff

revert test package change

6e4c8de

add test file

d383068

rm indices from fixture

97515e8

add ServiceEventsByChunk method on result

ec7c74e

fix chunk verifier test

e906251

add indices getter tests

b9849c6

add ServiceEventsByChunk tests

8961e51

chunk verifier tests

5579f2a

improve docs

f1125b3

jordanschalm changed the title ~~DRAFT: Chunk Data Model supports service event indices~~ Chunk Data Model supports service event indices Nov 21, 2024

This was referenced Nov 21, 2024

DRAFT: Verification of non-system-chunk service events #6629

Closed

DRAFT: Chunk-wise service event verification - ChunkIndexed[ServiceEvent] #6730

Closed

improve docs

718a5ee

jordanschalm marked this pull request as ready for review November 21, 2024 17:36

jordanschalm requested review from zhangchiqing, durkmurder and AlexHentschel November 21, 2024 17:36

jordanschalm mentioned this pull request Nov 21, 2024

Re-enable test TestProtocolVersionUpgrade #6752

Merged

jordanschalm added 4 commits November 28, 2024 08:55

adjust field and documentation

44f3c1a

update tests

13aba21

wip

defcdea

update BlockResult tests

405626e

jordanschalm changed the title ~~Chunk Data Model supports service event indices~~ Chunk Data Model supports per-chunk service event mapping Dec 2, 2024

jordanschalm added 2 commits December 2, 2024 10:50

fix test name

447017b

fix chunk verifier test

e465bbc

jordanschalm and others added 3 commits December 2, 2024 16:20

backward compatible RLP encoding + tests

140b324

note rpc change requirements

49bcbf6

Merge branch 'feature/efm-recovery' into jord/6622-chunk-service-events

176100f

jordanschalm mentioned this pull request Dec 3, 2024

Implement upgrade mechanism for Chunk ServiceEventCount field #6777

Open

This was referenced Dec 5, 2024

Remove ChunkBody backward-compatibility #6773

Open

Implement Protocol Upgrade Logic for ChunkBody Data Model Upgrade #6783

Open

NaNOteh approved these changes Dec 7, 2024

View reviewed changes

AlexHentschel approved these changes Dec 7, 2024

View reviewed changes

jordanschalm removed the request for review from durkmurder December 9, 2024 22:01

jordanschalm and others added 9 commits December 9, 2024 14:29

Apply suggestions from code review

ee99871

Co-authored-by: Alexander Hentschel <[email protected]>

add deprecated notes

4d40dcf

Merge branch 'jord/6622-chunk-service-events' of github.com:onflow/fl…

9b7735e

…ow-go into jord/6622-chunk-service-events

RLP notes

ad346e8

rename var

b484203

add test case demonstrating rlp order dependence

a61e30e

EncodeDecodeDifferentVersions docs

6fe51f6

sanity check for service event count field

4a651b6

fix ER test

4a27b73

zhangchiqing approved these changes Dec 10, 2024

View reviewed changes

model/flow/chunk.go Outdated Show resolved Hide resolved

jordanschalm and others added 3 commits December 9, 2024 18:03

add context to chunkverifier test

e1ee2f0

refactor generateEvents

9c1ef70

Update model/flow/chunk.go

9a10320

Co-authored-by: Leo Zhang <[email protected]>

NaNOteh approved these changes Dec 11, 2024

View reviewed changes

jordanschalm mentioned this pull request Dec 11, 2024

Add chunk service event count field onflow/flow#1531

Open

jordanschalm merged commit af44135 into feature/efm-recovery Dec 11, 2024
55 checks passed

jordanschalm deleted the jord/6622-chunk-service-events branch December 11, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Data Model supports per-chunk service event mapping #6744

Chunk Data Model supports per-chunk service event mapping #6744

jordanschalm commented Nov 20, 2024 •

edited

Loading

codecov-commenter commented Nov 20, 2024 •

edited

Loading

AlexHentschel commented Nov 23, 2024

jordanschalm commented Nov 25, 2024 •

edited by AlexHentschel

Loading

jordanschalm commented Dec 2, 2024

zhangchiqing commented Dec 3, 2024

jordanschalm commented Dec 4, 2024

AlexHentschel left a comment

AlexHentschel Dec 3, 2024

jordanschalm Dec 9, 2024

jordanschalm Dec 9, 2024

AlexHentschel Dec 7, 2024

jordanschalm Dec 10, 2024

AlexHentschel Dec 7, 2024

jordanschalm Dec 10, 2024 •

edited

Loading

AlexHentschel Dec 7, 2024

jordanschalm Dec 10, 2024

zhangchiqing left a comment

	// setup the verifier output to include the correct data for the service events
	output := generateDefaultOutput()
	output.ConvertedServiceEvents = meta.ServiceEvents
	output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event
	s.outputs[script] = output

Chunk Data Model supports per-chunk service event mapping #6744

Chunk Data Model supports per-chunk service event mapping #6744

Conversation

jordanschalm commented Nov 20, 2024 • edited Loading

Changes

Upgrade Notes

To Do Before Merging

codecov-commenter commented Nov 20, 2024 • edited Loading

Codecov Report

AlexHentschel commented Nov 23, 2024

strategic / conceptual thoughts

jordanschalm commented Nov 25, 2024 • edited by AlexHentschel Loading

Summary of Discussion with @AlexHentschel

Change of ServiceEventIndices structure

Removing backward-compatibility at next spork

Upgrade Comments

Rough Outline of Process

jordanschalm commented Dec 2, 2024

Summary of discussion with @zhangchiqing

zhangchiqing commented Dec 3, 2024

jordanschalm commented Dec 4, 2024

AlexHentschel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangchiqing left a comment

Choose a reason for hiding this comment

jordanschalm commented Nov 20, 2024 •

edited

Loading

codecov-commenter commented Nov 20, 2024 •

edited

Loading

jordanschalm commented Nov 25, 2024 •

edited by AlexHentschel

Loading

Change of `ServiceEventIndices` structure

jordanschalm Dec 10, 2024 •

edited

Loading